Skip to content

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700

Open
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission/2026-03-25-hedge-mixer-crown-q
Open

Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64#700
RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa:submission/2026-03-25-hedge-mixer-crown-q

Conversation

@RoyiRa
Copy link

@RoyiRa RoyiRa commented Mar 25, 2026

Record: 5-expert Hedge Mixer + CROWN-Q + stride=64 (val_bpb=1.0541)

val_bpb: 1.0541 (3-seed mean) | ~15.7 MB | 8xH100 SXM

Results (8xH100 80GB SXM)

Seed step_avg steps Pre-TTT bpb Post-TTT bpb TTT gain Eval time Artifact
1337 98.1ms 5,935 1.1251 1.0473 -0.0778 336s 15.89 MB
42 97.9ms 5,947 1.1264 1.0686 -0.0578 336s 15.69 MB
7 98.0ms 5,940 1.1246 1.0465 -0.0781 336s 15.66 MB
Mean 1.1254 1.0541 -0.0713 336s ~15.75 MB

Contributions

1. CROWN-Q Training Penalty (training-time)

Added a quantization-aware penalty during warmdown that penalizes weights sensitive to quantization error:

crown_q_loss = lambda * mean(w^2 * delta^2 / 12)

where delta = row_max / clip_range is the per-row quantization step size. This encourages weights to be quantization-friendly, reducing post-quantization degradation. CROWN_Q_LAMBDA=0.01.

Effect: Slightly better compression (artifact ~200KB smaller) and more robust quantization.

2. Eval stride 32 -> 64 (eval-time)

Changed sliding window stride from 32 to 64 during evaluation. Experiment showed identical BPB quality but 2x faster scoring. Frees ~100s of eval budget for more TTT epochs.

3. TTT Epochs 3 -> 4 (eval-time)

Increased test-time training from 3 to 4 epochs per chunk, using the time freed by stride=64. Each additional epoch adapts the model more to scored data. Tested 8 epochs but that overfits (1.0735 vs 1.0473 for 4 epochs).

Combined Effect

  • stride=64 saves ~100s of eval time
  • 4th TTT epoch uses ~85s of the saved time
  • Net eval time: ~336s (down from ~562s), well within 600s budget
  • BPB improvement: 1.0745 -> 1.0541 (-0.0204)

Architecture

Component Setting
Layers 11 (512d, 8H, 8KV)
MLP 3.5x with LeakyReLU(0.5)^2
BigramHash 6144 (dim=128)
XSA All 11 layers (ws=8)
VE128 Layers 9-10
Quantization Full GPTQ int5 + zstd level 22
Pruning 3% magnitude
TTT AdamW lr=0.0001, 4 epochs, 131K chunks, Polyak 0.998
Mixer 5-expert Hedge (neural, unigram, bigram, trigram, entropy)
Training reserve 18s (for EMA + calibration + quantization)
Early warmdown LR schedule targets 582s
CROWN-Q lambda=0.01 during warmdown
Eval stride 64 (was 32)

Reproduction

DATA_PATH=../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 MAX_WALLCLOCK_SECONDS=600 \
USE_MIXER=1 MIXER_ETA=0.1 \
TTT_EPOCHS=4 TTT_FREEZE_BLOCKS=2 \
TTT_LR=0.0001 TTT_CHUNK_TOKENS=131072 \
ADAPTIVE_LR=1 ADAPTIVE_LR_MAX=3.0 \
EVAL_STRIDE=64 \
CROWN_Q_LAMBDA=0.01 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Compliance

Constraint Limit Actual Status
Train time 600s 582s Pass
Eval time 600s 336s Pass
Artifact size 16,000,000 bytes 15,892,040 bytes (worst seed) Pass
No pre-scoring training Score-first TTT: each chunk scored under inference_mode() before any training on it Pass
GPTQ calibration in training budget Runs within 18s training reserve (1.9s actual) Pass

Credits

agalimova added a commit to agalimova/parameter-golf that referenced this pull request Mar 25, 2026
Built on PR openai#700 with hyperparameter improvements found via
autoresearch-multi combinatorial search:
- XSA_LAST_N=6 (extended from 4 to 6 layers)
- BIGRAM_VOCAB_SIZE=4096 (doubled from 2048)

3-seed mean: 1.1078 (std 0.0045)
Seeds: 42=1.1045, 1337=1.1061, 2025=1.1129

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@RoyiRa RoyiRa force-pushed the submission/2026-03-25-hedge-mixer-crown-q branch 2 times, most recently from 30e7835 to 57d1d2c Compare March 25, 2026 14:27
Asukabot0 added a commit to Asukabot0/parameter-golf that referenced this pull request Mar 25, 2026
1. Rewrite ttt_adapt() to score-first pattern (Issue openai#677 compliant):
   - Process val data in sequential chunks (TTT_CHUNK_TOKENS=131072)
   - Phase 1: score chunk under inference_mode (forward only)
   - Phase 2: train on scored tokens with AdamW (K epochs)
   - Each token scored BEFORE model trains on it

2. Switch TTT optimizer from SGD to AdamW (lr=0.0001, wd=0.0)
   - PR openai#700 showed AdamW >> SGD for TTT
   - Default 4 epochs, freeze first 2 blocks

3. Fix DDP find_unused_parameters → static_graph=True
   - Same 3x slowdown fix as submission directory

4. TTT defaults: disabled by default (TTT_ENABLED=0)
   - Enable with TTT_ENABLED=1 for TTT+n-gram combined eval

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant